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(57) ABSTRACT 

A database management system includes an execution 
engine that, upon rollback of an aborted transaction, has the 
ability to set fields of the rows that are updated or deleted by 
the transaction prior to aborting. In particular, when a select, 
update or delete statement includes a "set on rollback" 
clause, the respective table access operator of the execution 
engine modifies the "before image" of each row that is 
stored by the transaction log manager. The modified before 
image includes the field updates specified by the set on 
rollback clause. If a transaction associated with the state- 
ment aborts, when the transaction log manager restores the 
before images of each row that was deleted or updated by the 
transaction, the restored before images include the field 
updates specified by the set on rollback clause. 

6 Claims, 14 Drawing Sheets 
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DATABASE MANAGEMENT SYSTEM AND 
METHOD FOR UPDATING SPECIFIED 
TUPLE FIELDS UPON TRANSACTION 
ROLLBACK 

RELATED APPLICATIONS 

This application is related to the following applications: 
"System and Method for Performing Database Operation on 
a Continuous Stream of Tuples," Ser. No. 09/347,997; 
"Database Management System and Method For Accessing 
Rows in a Partitioned Table," Ser. No. 09/347,998; "Data- 
base Management System And Method For Dequeuing 
Rows Published to a Database Table," Ser. No. 09/347,996; 
"System And Method For Performing Database Operations 
and for Skipping over Tuples Locked in an Incompatible 
Mode," Ser. No. 09/347,995. 

FIELD OF INVENTION 

The present invention relates generally to a relational 
database system that has been extended to perform opera- 
tions on a continuous stream of tuples, and particularly to a 
system and method for returning to a calling application 
rows deleted from a database table. 

BACKGROUND OF THE INVENTION 

Traditional relational database systems receive com- 
mands falling into two categories: data definition commands 
(DDL) and data manipulation commands (DML). Data 
manipulation commands can be categorized as either queries 
that read data in the database or update statements that 
insert, delete or update data in the database base. Traditional 
database systems are designed and optimized for those 
operations. These systems are not designed and optimized 
for allowing applications to register to receive notifications 
when new data becomes available or existing data is 
updated. 

However, receiving such notifications is essential for 
many applications. The need for such services has caused 
most transaction service vendors to extend their systems by 
including separate, special purpose resource managers for 
queue management and publish/subscribe services. These 
resource managers allow applications to enqueue and 
dequeue requests for information. Publish/subscribe ser- 
vices allow applications to publish and subscribe to notifi- 
cations. 

In such systems, transactions access both an SQL data- 
base system (which may be considered to be a type of 
resource manager) as well as resource managers for queuing 
and/or publish/subscribe services. As a result, the SQL 
database system and the other resource managers have to 
participate together in an expensive two-phase commit pro- 
tocol. Due to the lack of integration between the resource 
managers and the database system, the SQL compiler cannot 
optimize access to both the data stored by the queuing or 
publish/subscribe resource manager and the data stored in 
the database, and cannot perform joins across the different 
data sources. 

To avoid subjecting applications to the two-phase commit 
protocol, some SQL database vendors (e.g., Oracle and 
Sybase) have integrated transactional queuing and publish/ 
subscribe services into their database products. While their 
implementations remove the need for a two-phase commit 
protocol, these implementations use special purpose objects 
for queues and publication channels. These implementations 
prevent queues and publication channels from being 
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accessed as part of SQL statements, unless the user is willing 
to forgo important characteristic properties of queue or 
notification channels, particularly suspending execution 
when all qualifying tuples have been returned and automati- 
5 cally resuming execution when new tuples become avail- 
able. It also effectively prevents the SQL compiler in these 
systems from optimizing access to notifications and SQL 
data. 

The present invention provides transactional queuing and 

10 publish/subscribe extensions to an SQL database infrastruc- 
ture. These extensions do not introduce any special objects. 
Applications access regular SQL database tables, including 
tables used as queues or publication channels. Applications 
use SQL select statements to subscribe and/or dequeue 

15 notifications. Furthermore, applications use SQL insert and 
update statements to publish notifications. These extensions 
remove the need for a two-phase commit protocol, allow 
applications to perform join operations on data from differ- 
ent sources, and allow the SQL compiler to optimize access 

20 to both notifications and "normal" SQL data, using previ- 
ously developed query optimization techniques. Further, 
execution of an SQL statement that accesses a data stream 
is blocked when all qualifying tuples have been processed 
and then resumes once new notifications become available. 

25 As indicated above, the present invention extends tradi- 
tional relational database systems to perform operations on 
a continuous stream of tuples, while retaining the traditional 
benefits of such systems, such as set oriented access using 
relational operators, transactional protection, and so on. The 
stream of tuples may represent queue entries, or notifica- 
tions. 

A simple example will assist in understanding the differ- 
ence between traditional and streaming queries. A traditional 
35 query for viewing or otherwise processing records relating 
to sales by a particular salesperson might be: 
select * from Saleslnfo where salesperson=»"Johnson, 
Mary". 

This query would retrieve all records from the table 
40 named "Saleslnfo" having a value of "Johnson, Mary" in the 
salesperson field (column). Only records in the table at the 
time the query is submitted will be reported by this query. 

However, if the user wants the system to continuously 
monitor new sales entries for a particular salesperson, the 
4 5 user of a traditional relational database system cannot define 
a simple SQL statement to perform that function, since 
monitoring features are not an integrated part of any of the 
widely used "dialects" of SQL. In contrast to that, the 
present invention allows users to issue an SQL select state- 
5 o ment that performs this function: 

select * from stream(SalesInfo) where salesperson- 

"Johnson, Mary"; 
Using the present invention, when an SQL statement 
specifies stream access, the execution of the SQL statement 
55 never returns an "end of data" indicator. Rather, when all 
qualifying tuples in the specified table have been processed, 
continued execution is blocked, and then resumes when 
more qualifying tuples become available. 

60 SUMMARY OF THE INVENTION 

A database management system includes an execution 
engine that, upon rollback of an aborted transaction, has the 
ability to set fields of the rows that are updated or deleted by 
the transaction prior to aborting. In particular, when a select, 
65 update or delete statement includes a "set on rollback" 
clause, the respective table access operator of the execution 
engine modifies the "before image" of each row that is 
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stored by the transaction log manager. The modified before to have the database management system modify rows in a 

image includes the field updates specified by the set on table before the rows are returned to the application, as a 

rollback clause. If a transaction associated with the state- integrated operation. 

ment aborts, when the transaction log manager restores the Another feature of the present invention is the use of a fan 

before images of each row that was deleted or updated by the 5 out operator and a set of partition scan operators in an 

transaction, the restored before images include the field execution engine to access a table that has been partitioned, 

updates specified by the set on rollback clause. when the table to be accessed by a statement is partitioned, 

In another aspect of the present invention, the SQL a respective partition scan operator is used to access rows of 

compiler and SQL executor of a relational database system each respective database table partition. A fan out operator 

are extended to process operations on streams of tuples and 30 receives requests from a calling application to access rows 
to access regular database tables as continuous streams of from table, and responds by sending requests for rows to the 

tuples. In particular, a new table access method provides partition scan operators. It receives rows of the table from 

"stream access" to a specified table. When using stream the partition scan operators and sends the received rows of 

access, the SQL executor first reads all qualifying tuples in the database table to another node in the execution tree. Each 

a specified table, and subsequently monitors for and returns 15 of the partition scan operators responds to the requests for 

new qualifying tuples being added to the table. The first part rows by returning to the fan out operator qualifying rows, if 

of the method is performed by a regular table scan, while the any, from the corresponding table partition. If the statement 

second part of the method is performed by a so-called delta being executed includes an embedded delete or update 

scan. The monitoring function is performed until the cursor operation, the qualifying rows are deleted or updated at the 

representing the SQL statement being executed, including 20 same time that the qualifying rows are returned to the fan out 

the scan operations, is closed by the calling application. operator. The transaction associated with the delete or 

The stream access mode of operation causes execution of update operation is not allowed to commit until all rows 

an SQL statement to block, and thus be suspended, when deleted or updated by the partition scan operators have been 

there no qualifying tuples to be returned to the calling t0 tDe calling application. When the statement being 

application. Execution of the SQL statement automatically 25 executed is a streaming mode query, the fan out operator 

resumes (actually, is rescheduled) when new data becomes non-blocking requests to the partition scan operators 

available. so that when one access to one partition is blocked, rows 

A set of data structures are provided to keep track of from anotner P^ 1 ™ arc processed, 

active table access operators associated with active state- 3Q BRIEF DESCRIPTION OF THE DRAWINGS 
ments (whose execution has not yet terminated). The data 

structures keep track of the status of each active table access Additional objects and features of the invention will be 

(scan) operator. In particular, a session control block is more readilv a PP arent ^om the following detailed descrip- 

expanded to include fields for keeping track of whether a tl0n and a PP e nded claims when taken in conjunction with 

scan is in the initial "regular" scan phase, or is in the delta 35 the drawings m whicn: 

scan phase of a scan operation. The session control block FIG. 1 is a block diagram of a transaction processing 

also includes a "delta scan list" of new and modified rows to system having an SQL compiler. 

be processed in the delta scan phase. When any transaction FIG. 2 depicts an execution tree generated by an SQL 

terminates, and the terminated transaction updated the delta compiler. 

scan list, the delta scan operation is automatically resched- ^ FIG. 3 depicts a system in which a database table is 

uled for execution to process the rows in the delta scan list partitioned and stored on multiple nodes of the system, 

that were added or modified by the terminated transaction. Fia 4 depicts request m6 response queues> ^ lajam 

Another feature of the present invention is a "skip con- as pre-fetch buffers, between tasks or operators in a query 

flict" access mode of operation. Normally, a table access execution tree. 

operator is blocked when it attempts to access a row that is 45 rg. 5 depicts an application execution time line showing 

locked in a conflicting mode, such as a write lock mode, for an application that executes a sequence of distinct 

which is inconsistent with the read or write access required transactions during execution of a single query. 

by the table access operator. Generally, the lock in the mr- a a^\^ « a c 

. fl . , i_ t !• , * , FIG. 6 depicts one node of a transaction processing 

conflicting mode will be held by a transaction other than the svstem 

transaction associated with the operator in question. When 50 * , . . . t 

the SQL statement being executed uses the new "skip c ™ depicts the relationship between an SQL compiler, 

conflict" syntax provided by the present invention, the SQL executor and a user process. 

operator skips over rows that are locked in a conflicting FIG - 8 depicts a control block data structure. 

mode, which would otherwise cause the operator to suspend FIGS. 9A, 9B, 9C, 9D, 9E and 9F represent control 

operation. Furthermore, if the operator is operating in 55 blocks, lists and queues used by the SQL executor to control 

streaming mode, a key that identifies such skipped rows is aQ d keep track of the status of table access operators being 

added to the operator's list of rows to be processed during executed by a disk process. 

the delta scan phase. FIGS. 10A and 10B depict flow charts of a table scan 

Another feature of the present invention is an execution procedure and a delta scan procedure, respectively, 

engine for executing select statements that have an embed- 60 FIG. 11 is a flow chart of an insert/update operation 

ded delete or update operation. A select statement with an procedure executed by a disk process for a table access 

embedded delete operation is used by applications to (scan) operator. 

dequeue data from a table. A dequeue statement in an FIG. 12 is a flow chart of a portion of a procedure 

application program is converted into a select statement performed when a transaction commits, 

(query) that uses the stream, skip conflict and embedded 65 FIG. 13 is a flow chart of a portion of an abort transaction 

delete features of the present invention. A select statement process performed by a disk process procedure for a table 

with an embedded update operation enables an application access (scan) operator. 
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FIG. 14 is a flow chart of a file scan procedure having node (e.g., 74-2) by a request queue 70 and a fetched records 

logic for handling Skip Conflict mode and streaming access queue 72. The request queue 70 stores requests being 

file scans. conveyed from the parent node to its child node, while the 

FIG. 15 depicts the use of a fan out operator node in a fetched records queue 72 conveys data and return codes 

query execution tree for automatically routing requests for 5 ( e -S*« an enc * °f fife or eQ d of scan code) being returned to the 

tuples to the partition access (scan) operators handling the parent node in response to the requests, 

various different partitions of a table. Referring to FIG. 5, while most aspects of the present 

FIG. 16 is a flow chart for executing a stream access scan invention concern the integration of streaming access into an 

with an embedded delete operation on a non-partitioned S Q L environment, some aspects of the invention involve 

t aD i e _ 30 transaction processing. Generally, a relational database 

FIGS. 17Aand 17B are a flow chart for executing a stream in ^ ludes ? management subsystem, which pro- 
access scan with an embedded delete operation on a parti- V1 f 65 va ™f services to apphcattons in conjunction with the 
tioned table relational database. While many transactions mclude the 
4 _ " . . , execution of a SQL statement, transactions and SQL state- 

FIG. 18 depicts a transaction log manager that has been 15 m6nt execution ^ not tne ^ thin To hel lain the 

modified to support a set on rollback- SQL syntax exten- dislincliollj reference ^ made to FIG 5 which shows the 

S10n * execution time line for an exemplary application. As shown, 

DESCRIPTION OF THE PREFERRED a PPjjf tion starts . a transaction (TxIDl), opens a 

EMBODIMENTS cursor (begins execution of a query), performs one or more 

20 fetches using the cursor, and then ends the first transaction. 

Conventional Relational Database Query Processing firet »«^tion ends by committing the transaction, all 

changes made to the database by the transaction are made 

FIGS. 1-5 show aspects of conventional relational data- durable and locks to the corresponding tuples that are held 

base query processing, and conventional transaction pro- by the transaction are released. 

cessing. These are also applicable to the preferred embodi- 25 Iq this example the application has not closed the cursor 

ments of the present invention. even though the first transaction has ended. Rather, the 

Referring to FIG. 1, in a traditional relational database application starts a second transaction, fetches more data 

environment, an application program 50 submits a query to using the open cursor, and then ends the second transaction, 

an SQL (structured query language) compiler 52. The SQL Generally, it is possible for an application to execute an 

compiler converts the query into an SQL plan or execution 30 unlimited number of transactions using a single open 

tree 54 that is executed by an SQL executor (not shown). The cursor — corresponding to a single query (SQL statement), 

compiled SQL plan stores and/or retrieves data from a Thus, query execution and transaction execution are distinct 

database 56. As shown in FIG. 2, the leaf nodes 57 of the operations that often overlap, but which are logically dis- 

execution tree are the ones that actually interact with the tinct. From another viewpoint, while many SQL statements 

database 56. Leaf nodes 57 are often called scan operators 35 are, in fact, executed completely within the context of a 

or table access operators. single transaction, the transaction and statement execution 

In this document, the term "table access operators" will be mechanisms are distinct and allow much more flexible 
used to collectively refer to scan operators, which are used arrangements. In the context of the present invention, when 
to read data from tables, as well as update, delete and insert streaming access to database tables, applications may 
operators, which are used to modify the contents of tables. use a sequence of transactions while executing long running 
However, sometimes the term "scan operator" will be used streaming access queries, 
synonymously with "table access operator." Many of the Computer System 
same control structures and system control procedures that _ c , „ . _ 
are applicable to scan operators are also applicable to other , c Before des ^ m S ^ operation of the invention, a corn- 
types of table access operators. 5 P uter s y stem 100 m wmch an embodiment of the invention 

„ r . A . . - , , is implemented will be described with reference to FIG. 6. 

Referring to FIG. 3, tables m the database 56 may be ^ e ^mouter system 100 includes: 

partitioned, with various partitions 58 being stored on dif- , . . . „„„ 

ferent nodes of the relational database system. Such parti- one or more data P rocess *g units ( CPU ' S ) ™ 2 > 

tioning is often used for extremely large tables. Also, various 50 memorv 104 > whlch ™& typically mclude both high speed 

tables within a database may be stored on different nodes of random access memory as well as non-volatile 

the system. Such distributed storage facilitates efficient, memory; 

parallel processing of queries, by distributing both the disk a uscr interface 106; 

I/O and computational burden over multiple nodes. In FIG. a network or other communication interface 108 for 
3, the "application process" 60 represents the process or 55 communicating with other computers; and 
processes that execute not only the application program, but one or more communication busses 110 for interconnect- 
also the portions of the execution tree 54 above the leaf ing the CPU(s) 102, memory 104, user interface 106 
nodes. The leaf nodes are executed by disk processes 62 in and network interface 108. 

each of the nodes of the transaction processing system. The computer system's memory 104 stores procedures 

While FIG. 3 shows one disk process 62 for each node, the 60 and data, typically including: 

number of disk processes per node may vary from one an operating system 112 for providing basic system 

implementation to another. For instance, a separate disk services; 

process may be used for each logical disk volume. a file system 114, which may be part of the operating 

Referring to FIGS. 2 and 4, data flows between the nodes system; 

of the execution tree 54 may be handled by the use of a pair 65 application programs 116, such as programs executed in 

of queues 70, 72 between parent and child nodes 74. In user address spaces on behalf of users of the system 

particular each parent node (e.g., 74-1) is coupled to a child 100; 
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an SQL compiler 118, for compiling SQL statements (or an Access Privilege module 148 for controlling access to 

more generally, database queries), and portions of SQL the various tables in the database. 

statements; The access privilege module 148 limits access to infor- 

an SQL executor 120 for executing SQL statements; mation in each table on a user by user basis. Furthermore, 

database tables 121, for storing data; and 5 ma y be granted read access to tables, but not 

a transaction management facility 122 that manages the delet f access ' ^ other ™re privileged or trusted users 

execution of transactions. are also S ranted ddete access ' The use of an access privilege 

The SQL compiler 118 may include: ^St^ database systems is well known to those skilled 

conventional SQL compiler logic 123 for compiling SQL 1Q * R ^ tQ mQ ? ^ SQL fler m ^ fof 

statements usmg previous defined (conventional) SQL ^ SQL executQr 12Q a ^ Qf ^ ^ ^ 

syn ax ' which the SQL executor 120 can build an execution tree 138. 

streaming access compiler logic 124, which modifies the defin i lion blocks 150 are loaded by the SQL executor 

conventional compiler logic to handle streaming access 120 and used to create control blocks (170, FIG. 8) arranged 

SQL statements, for example SQL statements usmg the 15 mto a graph ^Med an execution tree 138. The execution of 

new extended SQL syntax of the preferred embodi- SQL statements is thereafter accomplished by interpreting 

menl; these control blocks. In a preferred embodiment, the SQL 

embedded delete and embedded update compiler logic compiler 118 does not produce any binary machine instruc- 

125 for handling delete and update operations embed- tion code; instead it produces definition blocks 150 that the 

ded in SQL select statements, for example SQL state- 20 SQL executor 120 interprets. 

ments using the new extended SQL syntax of the The execution tree 138 contains all the components 

preferred embodiment; necessary for executing an SQL statement, including integ- 

skip conflict compiler logic 126, which modifies the rity checks, firing triggers, and the body of an invoked SQL 

convention compiler logic to handle skip conflict stored procedure. 

access to a table, using either regular or streaming 25 Referring to FIG. 8, each task control block 170 generated 

mode access; by the SQL executor includes a pointer 171 to a definition 

partition fan out compiler logic 127 for which modifies block 150, state information 172 about computational state 

the conventional compiler logic to handle embedded of the associated task, and a set of parent 173, child 174 and 

update and delete operations, for example SQL state- sibling 175 control block pointers, which are used to form 

ments using the new extended SQL syntax of the 30 the directed graph or tree structure, an example of which is 

preferred embodiment, when an update or delete opera- shown in FIG. 2. 

tion is applied to a partitioned table; and e , . 0 ~ T rt 

nil 1 f 1 - mo . * . , Streaming SQL Query 
set on rollback compiler logic 128, which modifies the 

conventional compiler logic to handle "set on rollback" ^ example of a streaming SQL query is: 

operations specified in SQL statements, for example 35 select * from stream (Saleslnfo) where salesperson= 

SQL statements using the new extended SQL syntax of "Johnson, Mary". 

the preferred embodiment. The "stream" keyword is used to indicate that the execu- 

The SQL executor 120, sometimes called the executor, u °n of the query is to continue even after the end of the table 

includes: is reached, i.e., all qualifying tuples have been returned. In 

a scheduler 130 for scheduling the execution of tasks in an 40 particular, when executing this query the SQL executor first 

execution tree 138 (also called a task tree); the sched- processes all the entries in the Saleslnfo table that meet the 

uler determines the order and frequency with which the "where" condition. Then it processes all new entries inserted 

tasks are executed; mto tne Saleslnfo table as they are added to the table. The 

a set of interprocess communication methods 132 for next P ortioD of mis document will describe the data struc- 

handling the communication of requests and replies 45 mres and procedural mechanisms used to in a preferred 

between tasks that are in different processes (i.e., in embodiment to perform (i.e., execute) a streaming query, 

different address spaces); the interprocess communica- FIGS * 9A > 9B ' 9C aad 9D show a of dala structures 

tion methods may be implemented as part of the }iscd t0 kee P lr . ack of the status of the var ious table access 

operating system 112; operators running on a node of a database management 

a set of task methods 136, which are the basic functions 50 5? / ? MS) * ^ data structures iaclude ®* control 

or procedures that can be performed by each of the £ ocks ?1°> faction control blocks 142, session control 

tasks in an execution task tree; the present invention b 0cks 144 ' a Pf Ita ^ aD waiters 1151 260 ' ^ various schcd ~ 

adds new tasks to this set of tasks; and ule f r ^ eues 270 > 2 ?- 

an execution tree 138, which may include a number of „ 'V ™ lCXt ° f .4" P™ ni d ^° n > a "le * ^ble 

CT1 L lroor iUnt n a , t , - J 4 55 partition. When a table is not partitioned, the entire table is 

subtrees that are located m separate processes. , . . t C1 r r , „, . ' . 

Tv n cru ~„ ™ ♦ 1 c n • 1 j r stored in a single nle. For each file there is a file control 

The SQL executor 120 also preferably includes a set of K1 , t/tn u- u * ■ ^1 j * M j r 

f 1 * 1 * .l * 1 • bloc}c which contains a file identifier 202, and a list of 

data structures for keeping track of the tasks it is executing, ' « . , , - Ajl „ , . ' , /, , 

including session control blocks 204. Each session control block 

, . . identified in the list 204 represents a table access operator 

file control blocks 140, including one for each table 60 that is accessing the file. A session control block represents 

partition in the database; me execution of one leaf node of an cxecution lrce . 

transaction control blocks 142, including one for each ^ discussed above with respect to FIG. 5, the execution 

active transaction; 0 f a S q L statement can include execution of a transaction, 

session control blocks 144, including one for each active or a sequence of transactions. While a transaction associated 

scan operator (i.e., leaf node in an execution tree); 65 with a statement is active, the statement may access multiple 

various scheduler queues 146 for keeping track of files, which may be on one or more nodes of the system, 

blocked, waiting and active tasks; and When discussing transactions, it is often convenient to say 
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that a transaction accesses various files to denote the access Scheduler queue 270 (FIG. 9E) is a list of session control 

operations done on behalf of the transaction, even through blocks for table access operators that are performing delta 

file access is actually being handled by the leaf node scans and have been rescheduled for execution because a 

operators of a statement. For ease of discussion, this tcrmi- blocking transaction has committed its results to durable 

nology will be used here. 5 storage. Each such table access operator is identified in the 

When a statement is executed in the context of a list 270 by a pointer to the corresponding session control 

transaction, there is a distinct transaction control block 142 block. 

(FIG. 9B) for that transaction on each node of the system for Scheduler queue 272 (FIG. 9F) is a list of session control 

which the statement has a table access operator. The trans- blocks for table access operators whose progress has been 

action control block 142 includes a transaction identifier 10 blocked, usually by a locked row in an conflicting mode 

field 212 and a list 214 of files affected by the identified (e.g., by a row for which there is write lock held by another 

transaction. The list 214 consists of a list of file identifiers transaction), or whose execution has otherwise been tem- 

216, which are implemented as file numbers in an exemplary porarily halted, but which will need to be rescheduled for 

embodiment. execution in the future. 

For each table access operator (i.e., active leaf node), 15 The scheduler may also contain numerous other lists and 
there is a session control block 144 (FIG. 9C). The session data structures that are not directly relevant to the present 
control block 144 includes scan information 222 about the discussion, but which are nevertheless essential to its opera- 
status of a scan, as well a delta scan list 250 that identifies tion. Since these data structures are well known to those 
table rows to be processed after the conclusion of a regular skilled in the art of designing SQL execution engines, they 
scan. It should be noted that the session control block 144 is 20 will be mentioned only to the extent they are relevant to 
a particular type of task control block 170 (FIG. 8), and thus explanation of the preferred embodiments of the present 
contains the same type of header information (171, 173, 174, invention. 

175) shown in FIG. 8, but not repeated here in FIG. 9C. The To best understand these data structures, their use will be 
definition block pointer 171 of each session control block explained with respect to the execution of a set of transac- 
points to the definition block (which may include compiled 25 tions. 
code) that is executed by the table access operator corre- 
sponding to the session control block. Scan Operator Procedure with Streaming Mode 

The scan information portion of the session control block . M1 iL . . 

1 j .urn * £ u First, we will assume that one or more streaming mode 

includes the following fields: . , . ^ , , . A . . & 

& queries are being performed, and that their transaction 

a filenum 224, which identifies the file (table partition) 30 contn)l bk)cks and bbcks faave ^ 

accessed by the table access operator; md kept up {0 dMe However, it should be noted that 

a BeginKey 226 and EndKey 228, which together specify a session control block 220 is used for all table access 

a range of records to be scanned by the table access operators, not just those in streaming mode. Further, some of 

operator; the mode flags, such as the ReverseScan and Skip Conflict 

a CurrentKey 230, which represents the key value of mode flags are applicable even when a non-streaming mode 

record currently being accessed by the table access scan is being performed. 

operator; application level programs that initiate streaming 
a ReverseScan flag 232, which indicates whether the file queries may use commands such as "publish" and "sub- 
is being scanned by the table access operator in reverse scribe" to insert or update data into a stream of data, and to 
index order; receive the data in that stream, respectively. The publish 
a DeltaScanStarted flag 234, which indicates whether the command is converted by the application into an ordinary 
table access operator has already accessed all data in SQL insert or update command, which is then compiled and 
the defined range, and is now in streaming mode; executed by the transaction processing system, 
a DeltaScan flag 236, which indicates whether or not 45 Referring to FIG. 10A, the scan operator procedure starts 
streaming mode is enabled for the table access opera- by establishing and initializing a session control block for a 
tor; and table access operator (300). All the fields of the session 
a Skip Conflict mode flag 240, which indicates how the control block shown in FIG. 9C are initialized. The DeltaS- 
table access operator is to respond when it attempts to canStarted field is always initialized to a value of "False" 
access a locked record in a conflicting mode. 50 because each table access operator begins with a regular 
Delta scan list 250 identifies table rows that the table sc* 11 * a nd only performs a delta scan (also called a secondary 
access operator needs to process at a later time. Each entry scan) after the regular scan is completed, 
in the delta scan list 250 includes a key 252 and a transaction Next, the scan operator procedure performs a regular scan 
identifier 254. The key is the index value of row that has of the identified table (302). Various details of this step will 
been modified by identified transaction. For instance, a table 55 be explained below. However, the basic aspects of a regular 
access operator may modify a row by inserting the row in the scan are as follows. The scan starts at the begin or end of the 
table, or by updating the contents of the row. range, depending on whether a reverse scan is being 
The DeltaScan waiter list 260 (FIG. 9D) is a list of session performed, and proceeds through the table until either a 
control blocks for table access operators that are performing locked row is encountered, or the end of the scan range is 
delta scans, but whose progress has been blocked by pending 60 reached. A row may be locked (in a conflicting mode) when 
transactions, or more generally by a lack of available quali- it is in use by a concurrent transaction other than the 
fying tuples to process. Each item 262 in the list 260 transaction associated with the current table access operator, 
includes a pointer to a session control block. Optionally, For instance, a transaction may lock a row to ensure that the 
each item 262 may also include the file number for the values in the row do not change until the transaction 
session control block referenced by that item, for ease of 65 completes, or because the contents of the row have been 
reference when determining which table access operators on modified but not yet committed. In any case, the scan 
the DeltaScan waiter list are eligible for resuming execution. operator procedure halts when it reaches a locked row, 
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unless the Skip Conflict flag 240 (FIG. 9C) is set, in which 
case the procedure skips that row and continues processing 
with the next row. More details of the Skip Conflict mode of 
operation are discussed below. 

When a table access operator is blocked by a locked row, 5 
the associated scan is temporarily halted and the request is 
added to a queue 270 (FIG. 9F) of blocked table access 
operators. The scheduler for the disk process automatically 
attempts to restart blocked table access operators when 
transactions commit or abort. When restarted, the table 10 
access operator resumes work at the row where it was last 
blocked. 

If the DeltaScan mode flag is set to False, indicating that 
the table access operator is not performing a streaming mode 
scan, the scan operator procedure sends an end of data 15 
message (304) to the parent node in the execution tree when 
the table access operator reaches the end of the scan range. 

If the DeltaScan mode flag is set to True, the scan operator 
procedure switches to DeltaScan mode when the table 
access operator reaches the end of the scan range (306). In 20 
particular, it sets the DeltaScanStarted flag in the session 
control block to True, and queues the session on tbe Del- 
taScan waiter list. 

Referring to FIG. 10B, the delta scan procedure is 25 
executed when a table access operator has finished its 
"regular scan" of the specified range of rows. The delta scan 
procedure processes any additional rows of data that are 
ready for processing, and then "goes to sleep" on the 
DeltaScan waiters list until more rows of data are ready for 3Q 
processing. Each time the delta scan procedure is executed 
for a table access operator, the procedure attempts to process 
the rows referenced in the delta scan list 250 (FIG. 9Q in the 
session control block for that table access operator. For each 
referenced row that is not locked, the row is processed and 35 
the corresponding key is removed from the delta scan list 
(310). 

When all the unlocked rows identified by the delta scan 
list have been processed, the session is put back on the 
DeltaScan waiters list (312). ^ 

The result sets created by stream mode table access 
operators of the present invention can be joined with the 
result sets of other table access operators, which enables 
efficient data processing. Basically, the result set returned by 
any leaf node, including a streaming mode leaf node, can be 45 
processed by higher level nodes of the execution tree. Thus, 
for example, a higher level node in the execution tree can 
perform a join operation on the result sets received from two 
or more leaf nodes. Further, the SQL compiler can optimize 
the execution of such join operations using existing compiler 50 
techniques and infrastructure. However, any node operator 
that would normally require an "end of data" from a lower 
node is modified so as operate on streams (i.e., without 
receiving an end of data). 

Transaction Update, Insert, Commit and Abort 55 
Procedures 

As discussed above, unlocked committed rows in the 
delta scan list of a table access operator need to be processed 
by the operator's delta scan. Below is an explanation of how 60 
updates and inserts performed by concurrent transactions are 
captured in delta scan lists, and how transaction termination 
(which causes locks to be released) make those rows avail- 
able for processing and also cause delta scans to be resched- 
uled for processing the newly available rows. 55 

Referring to FIG. 11, whenever an update or insert 
operation is performed, the corresponding disk file is 
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updated (320). However, if the update operation is being 
executed in the context of a transaction, the updates to the 
file are not permanent until the transaction commits and 
therefore the updated or inserted rows are locked to prevent 
other table access operators from using the information in 
these rows. 

Next, the insert and update operator procedure repeats the 
following steps (322, 324, 326) for each session control 
block listed in the file control block for the file whose 
contents were updated. First, the procedure determines if the 
operator for that session control block is affected by the table 
update (322). This determination is made by comparing the 
key(s) for the updated row(s) with (A) the full range of keys 
for the operator, and (B) with the remaining scan range for 
the table access operator. The remaining scan range is 
determined by inspecting the BeginKey, EndKey, Current - 
Key and ReverseScan fields of the session control block to 
determine a remaining scan range (i.e., a range of key 
values). 

If all the updated or inserted rows are either outside the 
range of keys for the table access operator, or are in the 
remaining scan range, the operator is not affected because 
either the operator does not need to process the updated or 
inserted rows, or it will do so automatically as the operator's 
scan progresses. If the table access operator is in streaming 
mode, and any of the rows that are updated or inserted are 
in the range of keys for the session but outside the remaining 
scan range, then the table access operator is affected because 
the operator will need to go back during the delta scan phase 
of its scan and process the updated or inserted rows. 
Similarly, if the table access operator is already in the delta 
scan phase, the operator is affected if any of the updated or 
inserted rows are in the key range for the operator. 

If the table access operator is not affected (320-No), the 
insert and update operator procedure processes the next 
session control block, if any, listed in the file control block 
(324). If the table access operator is affected (320-Yes), the 
key(s) of all the updated or inserted rows that can affect the 
table access operator are added to the delta scan list of the 
operator's session control block (326). Stated more simply, 
for each updated or inserted row that is not in the remaining 
scan range, but is in the range of keys for the table access 
operator, the key and transaction ID are added to the delta 
scan list of the session control block. As indicated, the 
transaction ID for the transaction that updated or inserted 
those rows is stored along with each key added to the delta 
scan list. 

Referring to FIG. 12, when a transaction commits, the 
transaction releases the row locks on the rows that the 
transaction has updated or inserted (330). By releasing the 
row locks, the changes to the database are committed and 
made available to other transactions. When a transaction 
commits and the row locks have been released, the scheduler 
receives a list of files that have been updated. It then inspects 
the DeltaScan waiters list and reschedules all requests (i.e., 
table access operators) in the list which correspond to the 
files in that list (332). Typically, pointers to the session 
control blocks for the rescheduled requests are added to a 
queue of requests for which further processing will resume 
when the scheduler is able to allocate resources to them. 

As a result, table access operators in delta scan mode that 
have stalled due to a lack of available, qualifying rows are 
automatically rescheduled for execution whenever a trans- 
action that may affect the availability of qualifying rows 
terminates. 

Referring to FIG. 13, when a transaction aborts, the 
transaction (i.e., the corresponding table access operators) 
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rolls back (undoes) the table updates made by the transaction not locked in a conflicting mode, the row is processed (364) 

(340). In addition, for each file affected by the aborted and the scan procedure then searches for a next row to 

transaction (as indicated by the file numbers listed in the process (360). However, if the next row found by the 

transaction's control block), the corresponding table access procedure is locked in a conflicting mode (362 -Y), what 

operator inspects the lists of keys for all corresponding 5 happens next depends on the mode of operation of the scan 

session control blocks (as indicated by the session control operator. If the scan operator's session is not in "Skip 

blocks listed in the file control blocks) and deletes all key Conflict" mode (366), the scan operator halts when it 

entries whose transaction ID matches the ID of the aborted reaches a locked row (368). When a table access operator is 

transaction (342). In this way, table access operators in halted b y a locked row > ^ operator's session is added to a 

streaming mode will not attempt to process rows that were io ! ? ck wu ^^2£ ^ £ a different <l ueue from 

updated or inserted by aborted transactions. The aborted lhe T £ u k eue 270 ^ of blocked «f \ °£ rat ° rs -,, A 

transaction releases the row locks held by the transaction , If TSu^l* r^T!l " Sj?^ 

/*aa\ u *u . *u ^ u c ■* i - . j (as indicated by the Skip Conflict flag 240 in FIG. 9C) (366), 

^ )'}\ DeltaScan waiters list and ^ ^ fe ^ mode (3?0) ^ £ 

reschedules aU requests (i.e table access operators) in the ^ Qr ^ ^ row (3?2) ^ J ^ 

list which correspond to the files in that list (346). Resched- is ^ the nexl row m ihG delta scan list In additioQ) it adc g 

uhng those requests is necessary because (A) the aborted a key entry to the delta ^ M in the ses ^ on control block 

transaction may have held locks that blocked the progress of f or the skipped row. The key entry enables the scan operator 

other table access operators, and (B) the aborted transaction to process the skipped row after its row lock has been 

may modify tuples through use of the set of rollback feature, released, 

described below. 20 If the scan operator's session is in "Skip Conflict" mode 

Skip Conflict Mode of Operation ( as ^cheated by the Skip Conflict flag 240 in FIG. 9C) (366), 

AK1 . a ^, . c . . - - and the session is not in streaming mode (370), then the scan 

A skip contact mode ot operation is provided to enable operator skips that row (374) and continues processing with 

a table access operator to process rows not locked in a the next row since lhe ^ or fc ^ jn slre ^ mi 

conflicting mode, without being blocked by rows that are ~ c _ i t . , ,. , . . , c ... ° 

. . , . & ' . . . , . • , . 25 mode, the delta scan list is not used for this scan operator, 

locked in a conflictmg mode (i.e., locked in a mode that n „A a„ •„ i- > r 

, & > v . ^ " and no entry is made in the delta scan list for the session 

prevents access to the rows by the operator). The skip control block 

conflict mode of operation is ^useful when processing records ^ ski Conflict m&de of tion ^ be 

quickly is important, but the order in which records are discuss( , d bdow ^ ^j^^ ^ the ddele tions . 

processed is not important. Further, the skip conflict mode of 30 

operation is useful in conjunction with update and delete Streaming Access to Partitioned Tables 
operations embedded in select statements, which will be An important aspect of Compaq/Tandem transaction pro- 
discussed in more detail below. cessing systems is that they are fault tolerant and scalable. 

The skip conflict mode of operation differs from tradi- Transparent horizontal partitioning of large database tables 

tional "browse" access. Browse access allows a transaction 35 is used to scale and distribute computational loads across the 

to read tuples on which another transaction has a lock in an nodes and devices in a cluster. "Horizontal partitioning" 

incompatible mode. For instance, even though the other means that a table is divided or partitioned into two or more 

transaction might have a write lock on a tuple, a transaction files, with each partition storing the records (tuples) having 

with browse access is allowed to read the tuple. Since the a corresponding range of key values. For example, each of 

data read using browse access may be internally 40 two or more partitions of a customer table might be used for 

inconsistent, any application using browse access must have records corresponding to customer names starting with 

appropriate safeguards to handle such internally inconsistent corresponding range of letters A-L. Tables can also be 

data. The skip conflict mode of operation differs from partitioned based on a hash value. For example, using a hash 

browse access in that skip conflict causes the table access function of a record index can have any of N (e.g., 4) values, 

operator to skip over tuples locked in an incompatible mode, 45 a table can be divided into N partitions, with the hash 

while browse access causes the operator to return tuple function being used to determine the partition in which each 

uncommitted values even when the tuples are locked in an table row is stored. Further, each partition may be stored on 

incompatible mode by another transaction. In addition skip a different node of the system to facilitate distribution of the 

conflict mode, when used in conjunction with streaming computational load on the system. 

access mode, keeps track of the skipped tuples so that they 50 The present invention extends the horizontal partitioning 

can be accessed during the delta scan, after the locks on the of database tables to queue and publication channels, and 

skipped tuples have been released. uses partitioning for data dependent routing and load distri- 

The preferred extended SQL syntax for using the skip bution. Thus, the present invention enables the use of 

conflict mode of operation is: horizontally partitioned queues and publication channels. 

select * from table where ... for skip conflict access 55 This is best explained by example. 

Referring to FIG. 14, the file scan procedure shown is for Referring to FIG. 15, when a leaf node of an execution 

performing a regular scan of a file, and corresponds gener- tree corresponds to a partitioned table, the leaf node scan 

ally to step 302 of FIG. 10A. Note that the file scan operator is replaced by a fan out operator 400 that is coupled 

procedure shown here does not include features of the to a partition scan operator 402 for each of the table 

invention which will be introduced later in this document. 60 partitions. During the first phase of a streaming mode scan, 

When first started, or after processing a row, the file scan sometimes called the regular scan phase, the function of the 

procedure finds a next row, if any, in the defined range of fan out operator 400 is quite simple. It directs data requests 

rows (360). Refer to FIG. 10A and the above discussion of to the partition scan operator for each of the table partitions, 

that figure for what happens when there are no more rows in The order in which such requests are directed to the partition 

the defined range. 65 scan operators may vary from one implementation to 

The scan procedure determines if the row is locked by another, and relates to aspects of database management 

another transaction (362) in a conflicting mode. If the row is systems not relevant to the present invention. 
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When all the partitions have completed the regular scan 
phase, the fan out operator sends additional requests to the 
partition scan operators in accordance with the type of scan 
being performed. The fan out operator can have more than 
one outstanding request pending at a time. Nowaited asyn- 5 
chronous operations are used for this purpose. 

It is noted that in one embodiment the fan out operator 
does not keep track of the scan phase (initial scan or delta 
scan) of its partition scan operators. Furthermore, the fan out 
operator does not attempt to retrieve rows from the parti- 10 
tioned table in any particular order. However, as a practical 
matter, initial scans by the partition scan operators will 
usually complete before many rows are retrieved from delta 
scans, because the partition scan operators in the delta scan 
phase will typically be suspended much of the time for lack 15 
of qualifying tuples, allowing the initial scans by the other 
partition scan operators to complete. 

In another embodiment the fan out operator includes a 
larger status table for keeping track of the key ranges 
associated with each partition scan operator, and more 20 
detailed status information about each partition scan opera- 
tor so as to be able to return rows to the calling application 
in a specified order. In this embodiment, the fan out operator 
extracts tuples from the partition scan operators in a order 
specified by the calling application during the initial scan. 25 

The partition scan operators support five types of requests 
from the fan out operator. One of these request types is used 
for non-streaming, read only access to a database, while a 
second request type is used for streaming, read only access. 
These two request types are defined as follows: 30 

1) Request Type 1: requests the partition scan operator to 
return all rows that satisfy the table access operator, and 
further requests that the partition scan operator not block 
if there isn't a row to return. In other words, when there 
are no rows to return, the partition scan operator returns 35 
a null set of rows to the fan out operator and does not wait 
for any further rows to be added to the partition. This type 

of request is typically used for non-streaming, read only 
access to a database table partition. The fan out operator 
sends this request to each table partition, and receives in 40 
response all records that satisfy the cursor. The request is 
non-blocking because the fan out operator does not want 
or need to receive records added to the table partition after 
the request is made. 

2) Request Type 2: requests the partition scan operator to 45 
return all rows, but blocks if there isn't one. This type of 
request is used for streaming, read only access (i.e., for 
streaming operators that do not delete or update tuples). 
This type of request is sent by the fan out operator to all 

of the partition scan operators so as to automatically so 
retrieve rows as they are inserted or updated in the table. 
Since this request type is used only for streaming access, 
this request type never returns an "end of data" indicator. 
Three additional types of request types are needed pri- 
marily for embedded delete or embedded update access to 55 
database tuples, which is discussed below. These request 
types are defined as follows: 

3) Request type 3: requests the partition scan operator to 
return one row that satisfies the cursor, and further 
requests that if there isn't a row to return the partition scan 60 
operator should block (i.e., wait) until a row satisfying the 
cursor is inserted into the table, or until such a row in the 
table is updated. In other words, the partition scan opera- 
tor resumes execution once further rows become avail- 
able. In a second implementation, this request type is used 65 
to request at least one, but no more than N, rows, where 

N is a parameter specified by the application fetching data 
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from the table. This request type is used for streaming, 
delete access and update access queries on non- 
partitioned tables, or if only a single partition is accessed 
by a query. 

4) Request type 4: requests the partition scan operator to 
return one row that satisfies the cursor, and further 
requests that if there isn't a row to return the partition scan 
operator should not block, but rather should return a null 
value. In a second implementation, this request type is 
used to request at least one, but no more than N, rows, 
where N is a parameter specified by the application 
fetching data from the table. This request type is used for 
streaming, delete access and update access queries on 
partitioned tables. It is also used for non-streaming, delete 
access and update access scans on both partitioned and 
non-partitioned tables. 

5) Request type 5: requests the partition scan operator to 
return a value indicating whether or not the partition scan 
operator has at least one row that satisfies the cursor, and 
further requests that if there isn't a row that satisfies the 
cursor, the partition scan operator should wait until a row 
satisfying the cursor is inserted into the table, or until such 
a row in the table is updated. This request type is used for 
streaming, delete access and update access scans on 
partitioned tables. 

Embedded Delete and Update Statements with 
Result Sets 

Normally, using prior art database systems, in order to 
update a tuple and then return it to a calling application (or 
to return a tuple to an application and then update it) requires 
two separate operations, an update operation followed by a 
select operation (or vice versa). Similarly, using prior art 
database systems, to both return a tuple to a calling appli- 
cation and delete the same tuple, requires two separate 
operations, a select followed by a delete operation. Using the 
present invention, delete or update operations can be embed- 
ded in a select statement, and thus can be performed in as a 
single combined operation, which is much more efficient. 

In a preferred embodiment of the present invention, delete 
and update statements are considered to be "embedded" in 
a select statement when a delete or update keyword is 
embedded in the "from" clause of a select statement, for 
example: 

select * from (delete from table) as t; 

A key issue concerning delete and update operations is to 
guarantee that transactions do not commit before all rows 
actually deleted or updated have been returned to the calling 
application. To avoid deleting or updating rows that are not 
returned to the calling application, the SQL executor pro* 
vides an interface that allows the application to specify how 
many rows are to be updated or deleted per fetch operation 
by the application. For instance, the five request types used 
by the partition scan operator and the fan out operator 
described in the previous section of this document enable the 
SQL executor to guarantee that all updated and deleted rows 
are returned to the calling application when the table that is 
being accessed is a partitioned table. 

The delete and update features of the present invention 
provide a destructive read capability and a "read modify 
write" capability in conjunction with streaming access to a 
database table. This allows queuing services to be provided 
by a relational database system while preserving the ability 
of the DBMS to perform other relational operators on the 
result set returned. Further, these features are useful in that 
they allow an application to combine delete, update and read 
operations into a single SQL statement, which reduces the 
cost of performing certain transactions. 
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Destructive reads are sometimes used to ensure that an transaction commit causes the effects of the delete or update 

item is processed exactly once. For instance, several "credit operations to be made permanent. A subsequence applica- 

evaluation" processes might be assigned the job of reading tion crash would cause the updated or deleted rows to be 

and processing credit applications. Each such process could lost, as they have been permanently updated or removed 

use a destructive read (i.e., delete operation with result set) 5 from the database by the committed transaction, but never 

to read a next credit application record for processing. In this delivered to the application. 

way, the credit evaluation processes work in parallel, with- ^ me result ^ Cfeated b me deJete m6 ^ access 

out interfering with each other while ensuring that each operations of lhe resent invention can be joined ^ tne 

credit application is processed once and only once. ^ ^ of ^ {Mc ^ M 

Destructive read capabilities are provided by some rela- 10 «- • , . . . T c j i j> 

tional database systems. However, such systems generally e ® Cient data Pressing through the use of delete and/or 

do not provide destructive read capabilities in conjunction update operations embedded in a query, 

with streams (i.e., notifications of newly received records). Furthermore, the ability to skip locked tuples, using the 

To provide similar functionality in such systems, the appli- ski P confllct access feature of ^ P resent invention, in 

cation program must submit and resubmit a query to the 15 combination with embedded delete and update operations 

system on a periodic basis so as to process newly inserted facilitates implementation of efficient transactional dequeu- 

and updated table entries. in g operations. 

In the preferred embodiment, only authorized users are Tn e application level enqueuing operations, which gen- 
granted delete or update access to database tables. An SQL erate or submit the tuples to be processed by dequeuing 
access privilege module 148 (FIG. 5) is used to store 20 operations, are translated into standard SQL insert and 
information about which users have delete or update access update statements, respectively. 

to specific tables, and to block delete or update access by Another example of a dequeuing operation that helps to 

unauthorized users. demonstrate both the new SQL syntax of the preferred 

At the application level, the destructive read function may embodiment and the usefulness of a combination of the 

be called "dequeuing" data from a queue. Thus, a dequeue 25 stream, delete and skip conflict features is as follows. A 

statement in an application program is converted into a credit check transaction may need to access a customer 

select query that uses the stream, skip conflict and embedded record Xo implement this, a select statement for continu- 

delete features of the present invention. That _ query is then ousl dequeuin g orders may be nested within a select 

submitted to the database system for compilation and execu- statement • ^ ^mtv records with orders, as follows: 

tion. 30 

An example of an SQL statement that would read and * from customer, (delete from stream(order) for 

delete only those tuples which are not currently read, skip conflict access) order 

updated, inserted or deleted by another concurrent transac- where order.customer#=customer.customer#; 

tion is as follows: [ n tne aD0V e example the statement, during execution, 

select * from (delete from t for skip conflict access) as t; 35 will skip all order entries that are locked by another trans- 

An example of a query that updates tuples not in use by action in a conflicting mode, 

another concurrent transactions is as follows: Other examples of select statements with embedded 

select * from (update t set dequeued="true" where update operations are as follows. The syntax "return new- 

dequeued< >"true" .fieldlist" is used in the from clause of a select statement to 

for skip conflict access) as t; indicate the updated field values that are to be returned to the 

The following are two examples of queries that use an calling application, while the syntax "return old. fieldlist" is 

ordinary database table as a transactional queue from which used in the from clause of a select statement to indicate the 

records are dequeued: before update field values that are to be returned to the 

select * from (delete from stream(t) for skip conflict 45 calling application. In a preferred embodiment, the return of 

access) as t; the updated tuple is implied. 

anci Two equivalent select statements that return updated tuple 

select * from (update stream(t) set dequeued="true" values to the calling application are: 

WDere select * from (update account set balance=balance+credit 

dequeued< >"true" for skip conflict access) as t; 50 return new.*) as account; 

As indicated, the keyword "stream" in the query is used an d 
to continuously dequeue items from a table, thereby creating 

a continuous stream of result tuples. * from ( u P da te account set balance-balance+ 

When returning deleted and updated rows to the :credit) as account; 

application, the SQL execution engine ensures that the 55 where ":credit" indicates a value provided by the calling 

transaction will not commit until all deleted or updated application. 

tuples have been retrieved by the application. This can be A select statement that returns both the before and after 

achieved by returning control to the application only after all tuple values is: 

selected rows have been returned. Alternately, the calling i^** / j . ..... L i 

t- „• i_ ,• select from (update account set balance«=balance+credit 

application can be given an explicit notification indicating 60 return new * u fl .au.v^aiau^ T u t uji 

that the fetch pipeline is empty, in which case any request to " * 

commit the transaction is rejected while the pipeline is not old.*) as account; 

empty. Other methodologies might be used in alternate A ***** statement returns a specific updated tuple 

embodiments of the invention. These techniques are used to nelcl va * ue 

prevent the application from committing the transaction 65 select * from (update account set balance=balance+credit 

while rows are "in transit," i.e., still in the fetch buffers 72 return 

(FIG. 4) between the scan operator and the application. The new.balance) as account; 
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A select statement that returns a specific tuple field before 446 to request the scan operator to block (i.e., wait) if there 

update value is: isn't a row in the table satisfying the cursor until such time 

select * from (update account set balance-balance+credit mat a qualifying row appears (i.e., is inserted) in the table. 

retura Example of Execution of Select Statement with 

old.balance) as account; 5 Embedded Delete and Stream Access for 

A select statement that returns a specific tuple field before Partitioned Table 

and after update is: Referring to FIGS. 15, 17A and 17B, when a destructive 

select * from (update account set balance-balance+credit rea d is applied to a partitioned table, a fan out operator node 

relurn 1Q is generated by the SQL executor to coordinate the row 

new.balance, old.balance) as account; fetches by the partition scan operators. 

The return clauses such as "return old.*" shown above can After compilation of the SQL statement with an embed- 
also be used with select statements having embedded delete ded delete, execution begins by the application process 
operations. "opening an SQL cursor" on the specified table (460). The 

The tuple fields returned by a select statement are defined J5 SQL executor uses the fan out operator for the specified 

in the definition blocks for the nodes of the execution tree. table to execute the cursor (462). Before the application 

The definition blocks for leaf nodes specify both an update fetches a row (470), the fan out operator "primes" the 

or delete operation and a set of tuple fields to be returned execution tree by sending a request of type 5 to each of the 

when the compiled statement is a select statement having an partition scan operators for the specified table. This request 

embedded delete or update operation. 2Q asks the partition scan operators to send a reply to the fan out 

Suppressing Pre-fetching operat ° r whe " the P**Mon scan operator has at least one 

° a row that satisfies the cursor (464). Each partiuon scan 

Referring to FIG. 4, the request queue 70 and a fetched operator that does not have any rows satisfying the cursor 

records queue 72 may be used by the transaction processing blocks until such time that it does have at least one row that 

system to pre-fetch records not yet requested by the appli- 25 satisfies the cursor. 

cation that submitted the query being processed. For The fan out operator receives zero or more reply messages 

instance, each node in the execution tree other than the leaf f rom tne partition scan operators, each reply indicating that 

nodes may be automatically configured to request as many a respective partition scan operator has at least one row that 

records as can be stored in the fetched records queue(s) 72 satisfies the cursor (466). The fan out operator keeps a record 

between it and its child or children nodes, even if such 3Q of which partition scan operators have indicated that they are 

records have not yet been requested by the application. ready t0 return a row ( 468 y Note that a partition ^ 

Pre-fetching can improve system performance, by making operator may indicate at step 466 that it has a row satisfying 

use of otherwise dormant system resources, and can improve the but that ^ instant later lhat row may ^ deleted 

system responsiveness by having data ready for the appli- by another scan operator working on behalf of another 

cation before it requests it. 35 ap pii C ation. Thus, such replies by partition scan operators 

However, unbound pre-fetching must be suppressed when are remembered, but not relied on by the fan out operator, 

executing an embedded delete or update statement. The when the application fetches a row using its cursor (470), 

application must control how many rows are to be affected me fan out operator determines if any partition scan opera- 

by the delete or update operation, and therefore the database tors have indicated that they are ready to return a row. If not, 

management system must only delete or update those ^ the fan out operator continues to wait for a reply from any 

records actually requested by the application. This is 0 f the partition scan operators (472). Once such a reply is 

achieved by the protocol described by example below. received, or if such a reply had been received prior to the 

Example of Execution of Select Statement with fetch by the application, the fan out operator selects a 

Embedded Delete, for Non -Partitioned Table partition scan operator that has indicated that it is ready to 

45 return a row. Any number of procedures might be used to 

Referring to FIG. 16, when an application requests a make this selection. For instance, while the scan is in regular 

dequeuing operation on a non-partitioned table, the dequeu- scan rao de, the selection criteria might be to select the scan 

ing operation is expressed as an SQL select statement using operator for the lowest index value portion of the table, and 

the "delete from t" syntax, described above. After compila- while the scan is in the delta scan mode the selection criteria 

tion of the SQL statement, execution begins by the appli- 50 might be a round robin selection criteria. The fan out 

cation process "opening an SQL cursor" on the specified operator sends a request to the selected partition scan 

table to be destructively read (440). Opening a cursor operator, requesting it to return one row that satisfies the 

activates the execution tree and primes the SQL executor for curS or, and to return a null value if there isn't a row to be 

fetching data from the specified table. The SQL executor returned (474). This is a non-blocking request, since the 

uses the leaf node scan operator for the specified table to 55 partition scan operator is being requested to return a reply 

execute the cursor (442). When the application fetches a row right away, the reply consisting of either a single row or a 

using its cursor (444), the fetch is passed to the operator null reply. 

node in the execution tree that is the parent of the scan If the selected partition scan operator does not have a row 

operator (446) That operator node sends a request (of t0 returnj me fan oul operator another ucs{ t0 the 

Request Type 4) to the scan operator, requesting the return 60 parlilion ^ operator n]s st ^ lhe 

of one row. See the description above, for Request Type 4. ution ^ n operator t0 xad a reply to ^ farj out ^^lor 

The scan operator returns one row to its parent node, for when the partition scan operator has at least one row that 
processing by the application, and deletes that row from the satisfies the cursor (476). In this way the fan out operator 
database table (448). The process then repeats when the will automatically receive a notification whenever the par- 
application fetches another row (444). 65 ut j on scan operator has at least one row lhat satisfies the 

If the select statement uses stream access mode, the parent cursor. Then the fan out operator returns to step 474 to select 

node will use Request Type 3 (see description, above) at step a partition scan operator from which to request a row. 
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If the selected partition scan operator does have a row to 
return, the selected partition scan operator returns one row 
(478) to the fan out operator for processing by the applica- 
tion (480), and deletes that row from the database table 
(478). At this point, the fan out operator assumes that the 5 
selected partition scan operator has at least one more row 
that satisfies the cursor, and continues operation at step 470. 
The above described fan-out process then repeats when the 
application fetches another row (470). 

In other implementations, the application may fetch mul- 10 
tiple rows (e.g., N rows) at a time. In those implementations, 
the fan out operator at step 474 will request the selected 
partition scan operator to send "up to N" rows. At step 478 
the partition scan operator will return anywhere from 1 to N 
rows, depending on how many rows it currently has that are 15 
ready for delivery to the requesting application. 

Setting Column Values Upon Transaction Abort 

Traditionally, when a transaction aborts, all changes to the 
database made by the transaction are undone, and the 
database is returned to a state that is indistinguishable from 
the transaction having never taken place. When there is a 
need to document or record the reason for aborting a 
transaction, the conventional method of handling this 
requirement is to return an abort reason code to the appli- 25 
cation program that started the transaction. However, when 
using transactional queues, client processes do not directly 
communicate with their servers. An abort of the server's 
transaction will just re -instantiate the request in the queue, 
causing another server to process the request anew. If the 30 
reason leading to the abort is not transient, any subsequent 
transaction will also abort. 

The Set on Rollback feature of the present invention 
enables ttj^ relational database facility to, for example, count 35 
the number of times an attempt has been made to dequeue 
and process a request. In particular, an SQL language 
extension has been made to change column values on abort. 
The language extension is the clause: 

set on rollback {column value change specification}. 40 

For example, a statement to dequeue orders from a table 
named "order" having the fields order#, customer# and 
aborts would be: 

select * from (delete from stream(order) set on rollback 
aborts=aborts+l for skip conflict access) as order; 45 

The "set on rollback" clause in the above example counts 
how often an attempt has been made to dequeue an entry. 

To implement the "set on rollback" feature, the SQL 
compiler is extended to accept "set on rollback" syntax and 
to generate item expressions for updating the "before image" 50 
of a column. The SQL compiler rejects any update on 
columns (i.e., within a set on rollback clause) that have 
dependencies such as constraints, triggers and indices that 
would require further actions that could not be executed at 
abort time. The compiler also rejects any update that may 55 
change the size of a column value, since that would interfere 
with physical (i.e., page level) recovery operations. There- 
fore updates to VARCHAR fields (in a "set on rollback" 
clause) are not supported. 

Referring to FIG. 18, the "set on rollback" is performed 60 
by modifying the before image of a column, before it is 
passed to the transaction log manager 500 for storage in the 
transaction log. More specifically, SQL compiler includes in 
the code 508 for any update, delete or insert operator 
(genetically herein called a table access operator) code 510 65 
for generating a before and after image for each modified 
and new tuple. In the enhanced SQL compiler of the present 
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invention the image generation code 510 includes code for 
updating one or more fields of the Before Image when the 
query being compiled includes a "set on rollback" clause 
that affects the table being accessed by this operator 508. 

When the Before and After Images are passed by the table 
access operator 508 to the transaction log manager 500, the 
Before Image contains one or more modified fields if the 
query being executed contained a corresponding "set on 
rollback" clause. The Before and After Images are stored by 
the transaction log manager 500 in a transaction log 502. 

In fact, the transaction log manager 500 maintains a log 
502 of every database record (row) update, insert and delete 
made by any transaction. Each entry in the log 502 includes 
a transaction ID, for identifying the transaction that modified 
the database table, a timestamp, a "before image" of the 
database record prior to the update, and an "after image" of 
the record after the update. Other information may also be 
stored in the transaction log that is not relevant to the present 
discussion. 

If a transaction using a query is aborted, the transaction 
management facility 122 restores to the relevant database 
table the Before Image for each tuple updated or deleted by 
the transaction. If the query used by the transaction included 
a "set on rollback" clause affecting that table, the Before 
Image that is restored will include one or more modified 
fields. 

In order to properly support streaming scan mode, "set on 
rollback" operations that modify the before image of a tuple 
are treated as update operations. More specifically, when 
modifying the before image of a tuple, the key for the tuple 
is inserted into the delta-scan key list of any scan operators 
affected by the tuple update, using the same procedure as the 
one discussed above with reference to FIG, 11. The inserted 
keys are removed from the delta-scan key lists when the 
transaction commits, but are retained if the transaction 
aborts. Thus, in FIG. 12, a step is added to the commit 
transaction procedure for removing keys from key lists for 
tuples that were updated only by a "set on rollback" clause. 
For tuples updated by a standard set clause in an SQL 
statement, executed within the context of the transaction that 
is committing, the corresponding keys are retained in any 
key lists in which they were inserted. 

Alternate Embodiments 

The present invention can be implemented as a computer 
program product that includes a computer program mecha- 
nism embedded in a computer readable storage medium. For 
instance, the computer program product could contain the 
program modules shown in FIG. 5. These program modules 
may be stored on a CDROM, magnetic disk storage product, 
or any other computer readable data or program storage 
product. The software modules in the computer program 
product may also be distributed electronically, via the Inter- 
net or otherwise, by transmission of a computer data signal 
(in which the software modules are embedded) on a carrier 
wave. 

While the present invention has been described with 
reference to a few specific embodiments, the description is 
illustrative of the invention and is not to be construed as 
limiting the invention. Various modifications may occur to 
those skilled in the art without departing from the true spirit 
and scope of the invention as defined by the appended 
claims. 

What is claimed is: 

1. A method of processing data in a database management 
system, in accordance with a specified statement, compris- 
ing: 
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starling execution of a transaction; 

establishing a table access operator for accessing a 

defined range of rows in a database table; 
using the table access operator, performing an update or 

delete operation on qualifying rows, if any, from the 5 

database table, in accordance with the specified state - 

mem; 

for each updated or deleted row, storing a before image of 
the row, the before image representing field values of 
the row prior to the update or delete operation; 10 

when the specified statement includes a predefined set on 
rollback clause that specifies updates to be performed, 
in the event of a transaction rollback, to one or more 
fields of each row updated or deleted by the transaction, J5 
modifying the before image of said each updated or 
deleted row, prior to the before image being stored, by 
performing the specified updates to the one or more 
fields of the row; 

in the event that execution of the transaction is aborted, 20 
performing a rollback in which the stored before 
images of the updated or deleted rows are restored to 
the database table. 

2. A method of processing data in a database management 
system, comprising: 25 

starting execution of a transaction; 

establishing a table access operator for accessing a 
defined range of rows in a database table in accordance 
with a specified statement, the specified statement 
including a clause indicating modifications that are to 30 
be performed, in the event of a transaction rollback, to 
one or more fields of each row that is updated or deleted 
by the transaction; 

using the table access operator, performing an update or 
delete operation on qualifying rows, if any, from the 35 
database table, in accordance with the specified state- 
ment; and 

in the event that execution of the transaction is aborted: 
performing a rollback in which the updated or deleted 
rows are restored to the database table so as to 40 
reverse the update or delete operation performed on 
the updated or deleted rows, and 
modifying one or more fields of each row that is 
restored to the database table in accordance with the 
clause of the specified statement indicating the modi- 45 
fications that are to be performed upon rollback of 
the transaction. 

3, A database management system, comprising: 

a transaction management facility for managing execution 
of transactions; 50 

a database, including a plurality of tables, each table 
storing data in one or more rows; 

an execution engine for performing an update or delete 
operation on qualifying rows, if any, from one of the 55 
database tables, in accordance with the specified state- 
ment; 

a transaction log manager for storing, for each updated or 
deleted row, a before image of the row, the before 
image representing field values of the row prior to the 60 
update or delete operation; 

before image generation means, operative when the speci- 
fied statement includes a predefined set on rollback 
clause that specifies updates to be performed, in the 
event of a transaction rollback, to one or more fields of 65 
each row updated or deleted by the transaction, for 
modifying the before image of said each updated or 
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deleted row, prior to the before image being stored, by 
performing the specified updates to the one or more 
fields of the row; 
the transaction management facility including means for 
performing a rollback in the event that execution of the 
transaction is aborted, so as to restore to the database 
table the stored before images of the updated or deleted 
rows, 

4. A database management system, comprising: 

a transaction management facility for managing execution 
of transactions; 

a database, including a plurality of tables, each table 
storing data in one or more rows; 

an execution engine for performing an update or delete 
operation on qualifying rows, if any, from one of the 
database table, in accordance with a specified 
statement, the specified statement including a clause 
indicating modifications that are to be performed, in the 
event of a transaction rollback, to one or more fields of 
each row that is updated or deleted by the transaction; 

the transaction management facility including means for: 
performing a rollback in the event that execution of the 
transaction is aborted, so as to reverse the update or 
delete operation performed on the updated or deleted 
rows, and 

modifying one or more fields of each row that is 
restored to the database table in accordance with the 
clause of the specified statement indicating the modi- 
fications that are to be performed upon rollback of 
the transaction. 

5. A computer program product for use in conjunction 
with a computer system, the computer system including a 
database, including a plurality of tables, each table storing 
data in one or more rows; the computer program product 
comprising a computer readable storage medium and a 
computer program mechanism embedded therein, the com- 
puter program mechanism comprising: 

a transaction management facility for managing execution 
of transactions; 

an execution engine for performing an update or delete 
operation on qualifying rows, if any, from one of the 
database tables, in accordance with the specified state- 
ment; 

a transaction log manager for storing, for each updated or 
deleted row, a before image of the row, the before 
image representing field values of the row prior to the 
update or delete operation; 

before image generation means, operative when the speci- 
fied statement includes a predefined set on rollback 
clause that specifies updates to be performed, in the 
event of a transaction rollback, to one or more fields of 
each row updated or deleted by the transaction, for 
modifying the before image of said each updated or 
deleted row, prior to the before image being stored, by 
performing the specified updates to the one or more 
fields of the row; 

the transaction management facility including means for 
performing a rollback in the event that execution of the 
transaction is aborted, so as to restore to the database 
table the stored before images of the updated or deleted 
rows. 

6. A computer program product for use in conjunction 
with a computer system, the computer system including a 
database, including a plurality of tables, each table storing 
data in one or more rows; the computer program product 
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comprising a computer readable storage medium and a 
computer program mechanism embedded therein, the com- 
puter program mechanism comprising: 

a transaction management facility for managing execution 

of transactions; 
an execution engine for performing an update or delete 
operation on qualifying rows, if any, from one of the 
database table, in accordance with a specified 
statement, the specified statement including a clause 
indicating modifications that are to be performed, in the 
event of a transaction rollback, to one or more fields of 
each row that is updated or deleted by the transaction; 
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the transaction management facility including means for: 
performing a rollback in the event that execution of the 
transaction is aborted, so as to reverse the update or 
delete operation performed on the updated or deleted 
rows, and 

modifying one or more fields of each row that is restored 
to the database table in accordance with the clause of 
the specified statement indicating the modifications that 
are to be performed upon rollback of the transaction. 
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